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Abstract 


Background: A new respiratory infectious epidemic, severe acute respiratory syndrome (SARS), 
broke out and spread throughout the world. By now the putative pathogen of SARS has been 
identified as a new coronavirus, a single positive-strand RNA virus. RNA viruses commonly have a 
high rate of genetic mutation. It is therefore important to know the mutation rate of the SARS 
coronavirus as it spreads through the population. Moreover, finding a date for the last common 
ancestor of SARS coronavirus strains would be useful for understanding the circumstances 
surrounding the emergence of the SARS pandemic and the rate at which SARS coronavirus diverge. 


Methods: We propose a mathematical model to estimate the evolution rate of the SARS 
coronavirus genome and the time of the last common ancestor of the sequenced SARS strains. 
Under some common assumptions and justifiable simplifications, a few simple equations 
incorporating the evolution rate (K) and time of the last common ancestor of the strains (Tg) can 
be deduced. We then implemented the least square method to estimate K and T, from the dataset 
of sequences and corresponding times. Monte Carlo stimulation was employed to discuss the 
results. 


Results: Based on 6 strains with accurate dates of host death, we estimated the time of the last 
common ancestor to be about August or September 2002, and the evolution rate to be about 0.16 
base/day, that is, the SARS coronavirus would on average change a base every seven days. We 
validated our method by dividing the strains into two groups, which coincided with the results from 
comparative genomics. 


Conclusion: The applied method is simple to implement and avoid the difficulty and subjectivity 
of choosing the root of phylogenetic tree. Based on 6 strains with accurate date of host death, we 
estimated a time of the last common ancestor, which is coincident with epidemic investigations, 
and an evolution rate in the same range as that reported for the HIV-I virus. 
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Background 

A new respiratory infectious epidemic, severe acute respira- 
tory syndrome (SARS), broke out and spread throughout 
the world, affecting over 8,000 individuals in 32 coun- 
tries[1,2]. In response to this outbreak, a global network 
of international collaborating laboratories was immedi- 
ately sponsored and established by World Health Organi- 
zation (WHO) to facilitate the identification of the 
causative agent of SARS. By now the putative pathogen of 
SARS has been identified, by experimental proof and by 
Koch's postulates, as a new coronavirus, a single positive- 
strand RNA virus [3-5]]. The whole genome of SARS coro- 
navirus was first sequenced by the British Columbia Cen- 
tre for Disease Control (CDC) in Canada on 23, April 
2003 [6], and subsequently a total of 16 SARS coronavirus 
strains isolated from Hanoi, mainland China, Hong 
Kong, Singapore, and Taiwan were sequenced within 
short time[7,8]. Phylogenetic analysis and comparative 
genomic studies based on these genomic sequences indi- 
cate that the SARS coronavirus is distinct from any of the 
previously characterized coronaviruses. Epidemiological 
investigations further indicate the SARS coronavirus 
strains may be divided into two different genotypes[9]. 


RNA viruses commonly have a high rate of genetic muta- 
tion, by which the viruses escape from host defence and 
evolve into novel viral strains. It is therefore important to 
know the mutation rate of the SARS coronavirus as it 
spreads through the population. Moreover, finding a date 
for the last common ancestor of SARS coronavirus strains 
would be useful for understanding the circumstances sur- 
rounding the emergence of the SARS pandemic and the 
rate at which SARS coronavirus diverge. 


Many attempts have been made to extrapolate the age of 
the common ancestor of sequenced genomes. Most of 
them are based on accurate phylogenetic tree reconstruc- 
tions, which demand a large amount computation, 
because of their application of the maximum likelihood 
strategy. Common for these methods is that it is critical to 
choose a sequence as the root of the phylogenetic tree. 
Korber et al. [10] implemented a parsimonious strategy, 
which used the consensus sequence including the most 
common bases appearing in strains as the ancestral 
sequence. 


Table |: Dates of hosts' death 
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Methods 

Among the 16 full-length SARS coronavirus genomes, we 
selected 6 strains for which the accurate date of host death 
is known, and on which our modelling was based. Our 
model performed calculation under two hypotheses, 
which are commonly adopted and have lead to accurate 
prediction in the study of HIV-1 virus [10]: first, nucle- 
otide variation of these strains occurred by independent 
mutations at random positions in a single ancestral 
sequence; second, there exist a molecular clock and a con- 
stant rate of evolution. In addition, we simplified the cal- 
culation by neglecting trivial non-linear effects of multi- 
mutation for a base, i.e. there has only been one mutation 
for a base at a specific position of all the sequences during 
SARS infection time. This simplification can be justified 
by further discussion (see Additional file: 1). 


For an ancestor sequence S, of a strain S, we can deduce 
for the assumptions above that 


E(D(So, S)) * K(T - To), 


where D(S,, S) is difference of the two sequences (as 
depicted by Hamming distance), T, is the date of the last 
ancestor, T is the date of host death (as an estimate of 
sampling date), and K is the evolution rate constant. The 
formula gives the expectation of sequence differences in 
proportion to the time of evolution. 


If S, is the last common ancestor of S and S', then we have 
E(D(S,S')) = E(D(So,S)) + E(D(S>,S')) (Fig. 1(a)). The 
equation takes this form under the simplification that 
along the total of the infection paths of the two sequences, 
mutation at any specific point of the sequences could, at 
most, only occur once. Thus E(D(S,S')) = K(T + T' - 2T9). 


The last common ancestor Sp of all the sequences is the 
root of the hidden phylogenetic tree with the strains as 
nodes. From the time To, the sequences should at least 
evolve along two different routes. Therefore, there should 
be a partition of the strains into B and B' such that every 
pair of strains S e B and S' € B' should share the root of 
the tree as their last common ancestor (Fig. 1(b)), i.e., for 
each pair <S, S'>, E(D(S,S')) should be linear to (T + T') 


ID Strain Date of host death Date form Feb. 22 
| BjO| 03-08-2003 13 

2 Bjo2 03-08-2003 13 

3 GZO| 02-10-2003 -13 

4 SIN2500 03-14-2003 19 

5 TOR2 03-05-2003 10 

6 US 03-29-2003 34 


Page 2 of 6 


(page number not for citation purposes) 


BMC Infectious Diseases 2004, 4 


Table 2: Grouping of the strains 
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Si sj D(S;, S) To*(i, j) Annotation 

GZOl BJO2 55 -172 Best Division Gl 
GZOl TOR2 53 -167 Best Division Gl 
GZOl US 56 -165 Best Division Gl 
GZOl SIN2500 53 -163 Best Division Gl 
GZOI BjOl 49 -153 Best Division Gl 
BjO2 TOR2 24 -64 Gl 
Bjo2 US 27 -61 Gl 
BJO2 SIN2500 24 -59 Gl 
BJO! Bjo2 16 -37 Gl 
BJO! TOR2 14 -32 Gl 
BJO! US 17 -30 Gl 
BJO! SIN2500 14 -28 Gl 
TOR2 US 7 0 G2 
SIN2500 TOR2 4 2 G2 
SIN2500 US 7 5 G2 


Note: The best division is shown to the top, where one group include GZO1 and the other include the other strains. And from the time of the last 
common ancestor T,*(i, j), the strains can be classified into G, = {GZ01,BJ01,BJ02} and G, = {TOR2,US,SIN2500}. 


BO 


Si 
a 4 B’ 
(a) (b) 


Figure | 
Phylogenetic Tree a) For two strains; b) For several strains, 
these can be divided into two groups from the last common 
ancestor. 


with same parameter Tp. Therefore, we can implement the 
least square method to estimate K and Ty from the dataset. 


Since the real partition cannot be known in advance, we 
carried out calculations for all of the possible partitions of 
these 6 strains. For each division we use the estimated K to 
calculate the possible T,(S,S') of each sequence pair. The 
division with the minimum variance of Tp(S,S') is taken as 
our best solution to the problem, and the corresponding 
K as an estimation of the mutation rate. 


To analyze how the parameters affect the results and sup- 
port our fitting method, the Monte Carlo method was 
employed. At first, we produced a phylogenetic tree (See 


Fig. 3(c)) and a table of parameters (See Table 3) includ- 
ing the evolution rate and the times of the sequences. 
From the time of the last common ancestor S, of the other 
sequences, every base of a given sequence has the possibil- 
ity to mutate over time according to the given evolution 
rate. So the other sequences, included intermediate 
sequences (I) and final sequences (F), can be obtained in 
steps in the stimulation according to the given phyloge- 
netic tree and the time parameters. After the sequences 
were obtained, we used our fitting method to get the evo- 
lution rate, without including the hidden parameters. By 
analysis of the estimated K from the data, we can get to 
know how the parameters affect our fitting results and the 
quality of our method. 


Results 

Of the 16 SARS coronavirus strains submitted to Genbank 
before June, 2003, 6 had accurate date of host death 
recorded. We chose these 6 to estimate the last common 
ancestor and the mutation rate of the SARS coronavirus 
(Table 1). We performed the calculation, and the fitting 
result of the best division (See Table 2) is shown in Fig. 2, 
including the differences between sequences D(S,S') ver- 
sus the time factor (T + T'). The evolution rate K was esti- 
mated to be 0.16 base/day, which is similar to the 
reported evolution rate of HIV-1 virus [10]. The date of the 
last common ancestor T, was found to be about August or 
September, 2002, which is also in accordance with the 
epidemic investigations finding that the first verifiable 
SARS case was reported as early as on November 11, 2002. 


We validated our estimation of the evolutionary rate by 


grouping strains according to the estimated date of their 
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The linear relation between D(S,S') and (T+T') The parame- 
ters were estimated from the best division of 6 strains, 
where K is the evolution rate (base/day) and Tg is the time 
(day) of the last common ancestor. 


pair wise last common ancestor. Applying the estimated K 
= 0.16, we can determine a date T)*(i, j) of the last com- 
mon ancestor for each pair <S;, S;> by E(D(S,S)) = K(T; + 
T,- 2Ty*(i, j)) [Table 2], and then divide the 6 into two 
groups, G, = {BJOl, BJO2, GZ01} and G, = 
{TOR2,SIN2500,US}. It is apparent that every two mem- 
bers in G, have a last common ancestor with a date T,* (i, 
j) > 0, while every two members in G, have corresponding 
To*(i, j) < 0. This would imply that the strains in G, have 
a more recent last common ancestor than those in G,. This 
partition of strains was supported by Ruan et al [9]. 


Discussion 

Analysis by Monte Carlo Method was employed to test 
our fitting method and explain why the error of the evolu- 
tion rate and time of last common ancestor was so large in 
our prediction. In a simulation of the simplified evolution 
model, sequences were generated according to a given 
phylogenetic tree, with parameters including evolution 
rate and times for each sequence. Two sets of parameters 
were used for a common phylogenetic tree, the evolution 
rate kept constant while time parameters differed (See Fig. 
3(c) and Table 3). In model 1 there is a narrow time dis- 
tribution two month of final sequences, while model 2 
had a wider time distribution of five months. 


Hundreds of iterations of sequence data from the stimula- 
tion were given according to the parameters. For each 
result, we could get estimated parameters by our fitting 
method. The estimated K distribution of the results 
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(shown in Fig 3) is in support of our fitting method, as in 
both models the estimates for the evolutionary rate con- 
verged on the set parameter (0.2). Model 2 with wide time 
distribution had a narrower distribution of K, which indi- 
cates the fitted parameter has a smaller error. The differ- 
ence between the two models hints a narrow sampling 
time window as a partially explanation of the large error 
on the estimated K for the real data. 


Ideally, an estimation of evolution rate and the date of last 
ancestor for the SARS coronavirus should be based on 
sampling dates, with possible adjustments for culturing 
time and conditions. As such data were neither included 
in the submissions to Genbank, nor obtainable by direct 
contacts to the sequencing labs, we were left to choose 
between less ideal age estimates for the strains, such as 
date of host death, sequencing date, or submission date to 
Genbank. Sequencing dates were no more available than 
sampling dates, and for some groups several sequences 
were submitted to Genbank on the same date. In addition 
large part of the GenBank sequence were submitted long 
after June 2003, when no or very few SARS patient were 
available for sampling, also rendering submission date a 
not very accurate estimate for strain time. This basically 
left us with little other choice than to accept the date of 
host death as the most accurate available estimate for the 
age of each strain. Assuming that in most cases samples 
were taken a few days before to just after the death of the 
host, we think these dates represent acceptable, though 
not ideal, estimates of the endpoint of strain time. 


In summary, certain inherent features of the situation 
around the SARS epidemic prevented our method from 
rendering more accurate estimates. First, as national and 
international efforts fortunately succeeded in stemming 
the spread of fledgling epidemic by summer 2003, all the 
samples used to obtain the 16 sequences were collected 
within a relatively short period of time (two months), 
which makes the error of D(So, S,) is relative large. Second, 
because the date of host death is not good reflection of 
real time of sequences, the error of time is quite large. 
Third, as useful time data for the submitted sequences 
were scarce, the subset of sequences available for model- 
ling was too small. Finally, as data on pre-sequencing cul- 
turing times and conditions have not been made 
available, differences in evolutions rates between in vivo 
and in vitro conditions cannot be estimated, and the basic 
assumption, that only a constant evolution rate may not 
be completely valid. A more accurate model considering 
two evolution rate parameters may produce a more accu- 
rate estimation, particularly on a larger dataset with accu- 
rate sampling and sequencing times. 
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Figure 3 
Estimated K for Monte Carlo Simulation The distribution of estimated K is shown in a) and b): a) Model |; b) Model 2. The 


common phylogenetic tree is shown in c) 


Table 3: Parameters in the Monte Carlo stimulation 


K To Ti Tr Ty Ti Tr Trp Tr3 Tra Tes Tr¢ 

base/day Day day day day day day day day day day day 

Model | 0.2 -123 -100 -30 -60 -40 19 10 34 13 13 -13 

Model 2 0.2 -180 -120 0 -90 -60 60 30 0 -30 -60 -90 
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Conclusions 

We have proposed a mathematical model to estimate the 
evolution rate of the SARS coronavirus genome as well as 
the time of the last common ancestor of the various SARS 
coronavirus strains. The method is simple to implement 
and avoids the difficulty and subjectivity of choosing the 
root of phylogenetic tree. Based on 6 strains with accurate 
dates of host death, we estimated a time of the last 
common ancestor, which is coincident with epidemic 
investigations, and an evolution rate in the same range as 
that reported for the HIV-1 virus. 
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